Chief AI Officer Program at Costello College of Business (GMU)
Vadim Sokolov
George Mason University
today
Robots and Automatic Machines Were Generally Very Inventive: Al-Jazari (XII Century)
Hesdin Castle (Robert II of Artois), Leonardo’s robot…
Jaquet-Droz automata (XVIII century):
Logic machine of Ramon Llull (XIII-XIV centuries)
Old AI
If rain outside, then take umbrella
This rule cannot be learned from data. It does not allow inference. Cannot say anything about rain outside if I see an umbrella.
New AI
Probability of taking umbrella, given there is rain
Conditional probability rule can be learned from data. Allows for inference. We can calculate the probability of rain outside if we see an umbrella.
Definition:
The computer program learns as the data is accumulating relative to a certain problem class \(T\) and the target function of \(P\) if the quality of solving these problems (relative to \(P\)) improves with gaining new experience.
There are no correct answers, only data, e.g. clustering:
In shadows of data, uncertainty reigns,
Bayesian whispers, where knowledge remains.
With prior beliefs, we start our quest,
Updating with evidence, we strive for the best.
A dance of the models, predictions unfold,
Inferences drawn, from the new and the old.
Through probabilities, we find our way,
In the world of AI, it’s the Bayesian sway.
So gather your data, let prior thoughts flow,
In the realm of the unknown, let your insights grow.
For in this approach, with each little clue,
We weave understanding, both rich and true.
A humorous and illustrative scene of a hockey player sitting on a bench in full gear, holding a hockey stick in one hand and a whiteboard marker in th
Old AI: Deep Blue (1997) vs. Garry Kasparov
Alpha GO vs Lee Sedol: Move 37 by AlphaGo in Game Two
Probability lets us talk efficiently about things that we are uncertain about.
All these involve estimating or predicting unknowns!!
Random Variables are numbers that we are not sure about. There’s a list of potential outcomes. We assign probabilities to each outcome.
Example: Suppose that we are about to toss two coins. Let \(X\) denote the number of heads. We call \(X\) the random variable that stands for the potential outcome.
Probability is a language designed to help us communicate about uncertainty. We assign a number between \(0\) and \(1\) measuring how likely that event is to occur. It’s immensely useful, and there’s only a few basic rules.
We describe the behavior of random variables with a Probability Distribution
Example: Suppose we are about to toss two coins. Let \(X\) denote the number of heads.
\[X = \left\{ \begin{array}{ll} 0 \text{ with prob. } 1/4\\ 1 \text{ with prob. } 1/2\\ 2 \text{ with prob. } 1/4 \end{array} \right.\]
\(X\) is called a Discrete Random Variable
Question: What is \(P(X=0)\)? How about \(P(X \geq 1)\)?
“happiness index” as a function of salary.
| Salary (\(X\)) | Happiness (\(Y\)): 0 (low) | 1 (medium) | 2 (high) |
|---|---|---|---|
| low 0 | 0.03 | 0.12 | 0.07 |
| medium 1 | 0.02 | 0.13 | 0.11 |
| high 2 | 0.01 | 0.13 | 0.14 |
| very high 3 | 0.01 | 0.09 | 0.14 |
Is \(P(Y=2 \mid X=3) > P(Y=2)\)?
The computation of \(P(x \mid y)\) from \(P(x)\) and \(P(y \mid x)\) is called Bayes theorem: \[ P(x \mid y) = \frac{P(y,x)}{P(y)} = \frac{P(y\mid x)p(x)}{p(y)} \]
This shows now the conditional distribution is related to the joint and marginal distributions.
You’ll be given all the quantities on the r.h.s.
Key fact: \(P(x \mid y)\) is generally different from \(P(y \mid x)\)!
Example: Most people would agree
\[\begin{align*} Pr & \left ( Practice \; hard \mid Play \; in \; NBA \right ) \approx 1\\ Pr & \left ( Play \; in \; NBA \mid Practice \; hard \right ) \approx 0 \end{align*}\]
The main reason for the difference is that \(P( Play \; in \; NBA ) \approx 0\).
Two random variable \(X\) and \(Y\) are independent if \[ P(Y = y \mid X = x) = P (Y = y) \] for all possible \(x\) and \(y\) values. Knowing \(X=x\) tells you nothing about \(Y\)!
Example: Tossing a coin twice. What’s the probability of getting \(H\) in the second toss given we saw a \(T\) in the first one?
Sally Clark was accused and convicted of killing her two children
They could have both died of SIDS.
The chance of a family which are non-smokers and over 25 having a SIDS death is around 1 in 8,500.
The chance of a family which has already had a SIDS death having a second is around 1 in 100.
The chance of a mother killing her two children is around 1 in 1,000,000.
The \(\frac{1}{100}\) comes from taking into account genetics.
\[ P \left( \mathrm{both} \; \; \mathrm{SIDS} \right) = (1/8500) (1/8500) = (1/73,000,000) \]
\[ \frac{p(I|E)}{p(G|E)} = \frac{P( E \cap I)}{P( E \cap G)} \] \(P( E \cap I) = P(E|I )P(I)\) needs discussion of \(p(I)\).
The expected value of a random variable is simply a weighted average of the possible values X can assume.
The weights are the probabilities of occurrence of those values.
\[E(X) = \sum_x xP(X=x)\]
With \(n\) equally likely outcomes with values \(x_1, \ldots, x_n\), \(P(X = x_i) = 1/n\)
\[E(X) = \frac{x_1+x_2+\ldots+x_n}{n}\]
\[E(X) = \frac{1}{37}\times 36 + \frac{36}{37}\times 0 = 0.97\]
\[E(X) = \frac{18}{37}\times 2 + \frac{19}{37}\times 0 = 0.97\]
Casino is guaranteed to make money in the long run!
The variance is calculated as
\[Var(X) = E\left((X - E(X))^2\right)\]
A simpler calculation is \(Var(X) = E(X^2) - E(X)^2\).
The standard deviation is the square-root of variance.
\[sd(X) = \sqrt{Var(X)}\]
\[Var(X) = \frac{1}{37}\times (36 - 0.97)^2 + \frac{36}{37}\times (0 - 0.97)^2 = 34\]
\[Var(X) = \frac{18}{37}\times (2 - 0.97)^2+ \frac{19}{37}\times (0- 0.97)^2 = 1\]
If your goal is to spend as much time as possible in the casino (free drinks): place small bets on black/red
Tortoise and Hare are selling cars. Probability distributions, means and variances for \(X\), the number of cars sold
| 0 | 1 | 2 | 3 | Mean | Variance | sd | |
|---|---|---|---|---|---|---|---|
| cars sold | \(X\) | \(E(X)\) | \(Var(X)\) | \(\sqrt{Var(X)}\) | |||
| Tortoise | 0 | 0.5 | 0.5 | 0 | 1.5 | 0.25 | 0.5 |
| Hare | 0.5 | 0 | 0 | 0.5 | 1.5 | 2.25 | 1.5 |
Let’s do Tortoise expectations and variances
The Tortoise \[\begin{align*} E(T) &= (1/2)(1) + (1/2)(2) = 1.5 \\ Var(T) &= E(T^2) - E(T)^2 \\ &= (1/2)(1)^2 + (1/2)(2)^2 - (1.5)^2 = 0.25 \end{align*}\]
Now the Hare’s \[\begin{align*} E(H) &= (1/2)(0) + (1/2)(3) = 1.5 \\ Var(H) &= (1/2)(0)^2 + (1/2)(3)^2- (1.5)^2 = 2.25 \end{align*}\]
What do these tell us above the long run behavior?
Two key properties:
Let \(a, b\) be given constants
where \(Cov(X,Y)\) is the covariance between random variables.
What about Tortoise and Hare? We need to know \(Cov(\text{Tortoise, Hare})\). Let’s take \(Cov(T,H) = -1\) and see what happens
Suppose \(a = \frac{1}{2}, b= \frac{1}{2}\) Expectation and Variance
\[\begin{align*} E\left(\frac{1}{2} T + \frac{1}{2} H\right) &= \frac{1}{2} E(T) + \frac{1}{2} E(H) = \frac{1}{2} \times 1.5 + \frac{1}{2} \times 1.5 = 1.5 \\ Var\left(\frac{1}{2} T + \frac{1}{2} H\right) &= \frac{1}{4} 0.25 + \frac{1}{4} 2.25 - 2 \frac{1}{2} \frac{1}{2} = 0.625 - 0.5 = 0.125 \end{align*}\]
Much lower!
Many Business Applications!! Suggestions vs Search….
Alice is a 40-year-old women, what is the chance that she really has breast cancer when she gets positive mammogram result, given the conditions:
The posterior probability \(P(\text{cancer} \mid \text{positive mammogram})\)?
Of 1000 cases:
Conditional probability is how AI systems express judgments in a way that reflects their partial knowledge.
Personalization runs on conditional probabilities, all of which must be estimated from massive data sets in which you are the conditioning event.
Many Business Applications!! Suggestions vs Search, ….
Will a subscriber like Saving Private Ryan, given that he or she liked the HBO series Band of Brothers?
Both are epic dramas about the Normandy invasion and its aftermath.
100 people in your database, and every one of them has seen both films.
Their viewing histories come in the form of a big “ratings matrix”.
| Liked Band of Brothers | Didn’t like it | |
|---|---|---|
| Liked Saving Private Ryan | 56 subscribers | 6 subscribers |
| Didn’t like it | 14 subscribers | 24 subscribers |
\[P(\text{likes Saving Private Ryan} \mid \text{likes Band of Brothers})=\frac{56}{56+14}=80\%\]
But real problem is much more complicated:
The solution to all three issues is careful modeling.
The fundamental equation is: \[\text{Predicted Rating} =\text{Overall Average} + \text{Film Offset} + \text{User Offset} + \text{User-Film Interaction}\]
These three terms provide a baseline for a given user/film pair:
Why Should Executives Care?
| Business Question | Distribution |
|---|---|
| Will the customer buy? | Binomial |
| How many orders today? | Poisson |
| What’s the forecast error? | Normal |
Choosing the right distribution is the first step in building a reliable model. Wrong distribution = wrong predictions!
Models the number of successes in \(n\) independent trials, each with probability \(p\)
\[P(X=k) = \binom{n}{k} p^k(1-p)^{n-k}\]
Key Parameters:
Examples: A/B test conversions, click-through rates, quality defects
The Patriots won 19 out of 25 coin tosses in 2014-15. How likely?
The “Law of Large Numbers” Perspective:
With 32 NFL teams over 20+ years, some team will have a suspicious streak!
Key insight: Probability of Patriots specifically = 0.5%. But probability that some team has a streak ≈ much higher!
Business lesson: When auditing for fraud or anomalies:
Looking at enough data, you’ll always find something “unusual”
How many goals will a team score? Historical EPL data:
Each row = one match with final scores.
The Business Problem:
Sports betting: $200+ billion industry.
Our approach:
Who uses this? FiveThirtyEight, ESPN, DraftKings, Betfair, team analytics
A key signature of Poisson data: the mean equals the variance.
Model Diagnostics: Mean vs Variance
| Relationship | Suggests |
|---|---|
| Variance ≈ Mean | Poisson ✓ |
| Variance > Mean | Overdispersion (Negative Binomial) |
| Variance < Mean | Underdispersion (rare) |
Other Poisson Applications:
Poisson is the “go-to” for count data!
#| code-fold: true
#| code-summary: "Show R code"
#| fig-height: 5.5
goals <- c(epl$home_score, epl$guest_score)
lambda <- mean(goals)
x <- 0:8
observed <- table(factor(goals, levels = x)) / length(goals)
expected <- dpois(x, lambda = lambda)
barplot(rbind(observed, expected), beside = TRUE,
names.arg = x, col = c("steelblue", "coral"),
xlab = "Goals Scored", ylab = "Proportion",
legend.text = c("Observed", "Poisson Model"))Model Validation:
The Poisson model (coral bars) fits the observed data (blue bars) remarkably well!
What this tells us:
Slight discrepancy at 0 goals: Real matches have slightly fewer 0-0 draws than Poisson predicts (teams try harder when level!)
Models count of random events: goals, arrivals, defects, clicks
\[P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!}\]
Business Applications:
If events are rare and independent, Poisson is your model!
A single \(\lambda\) for all teams is too simple. Better model:
\[\lambda_{ij} = \text{Attack}_i \times \text{Defense}_j \times \text{HomeAdvantage}\]
This is how real sports analytics works:
Same framework applies to:
To predict a specific match, we estimate each team’s scoring rate:
For Arsenal vs Liverpool at home, we estimate Arsenal will score about 1.8 goals on average. Liverpool’s away \(\lambda\) would be calculated similarly.
#| echo: true
#| code-fold: true
#| code-summary: "Show R code"
# Simple estimate: average goals scored and conceded
arsenal_attack <- mean(epl$home_score[epl$home_team_name == "Arsenal"])
liverpool_defense <- mean(epl$home_score[epl$away_team_name == "Liverpool"])
league_avg <- mean(goals)
# Arsenal's expected goals vs Liverpool (simplified)
lambda_arsenal <- arsenal_attack * (liverpool_defense / league_avg)
lambda_arsenalOnce we have \(\lambda\) for each team, we can simulate the match thousands of times.
For Arsenal (\(\lambda=1.8\)) vs Liverpool (\(\lambda=1.5\)), running 10,000 simulations gives:
This is how betting companies set their odds!
#| echo: true
#| code-fold: true
#| code-summary: "Show R code"
set.seed(42)
n_sims <- 10000
# Simulate Arsenal vs Liverpool
arsenal_goals <- rpois(n_sims, lambda = 1.8) # λ for Arsenal
liverpool_goals <- rpois(n_sims, lambda = 1.5) # λ for Liverpool
# Match outcomes
c(Arsenal_Win = mean(arsenal_goals > liverpool_goals),
Draw = mean(arsenal_goals == liverpool_goals),
Liverpool_Win = mean(arsenal_goals < liverpool_goals))Each simulation draws random goals from Poisson distributions
This is how FiveThirtyEight and bookmakers build their models!
Monte Carlo Applications:
When math is too hard, simulate!
The most important theorem in statistics:
The average of many independent random events tends toward a Normal distribution, regardless of the original distribution.
Why it matters: Stock returns, measurement errors, test scores — all tend to be Normal because they’re sums of many small effects.
Practical Implications:
Rule of thumb: Sample size ≥ 30 usually sufficient for CLT to kick in
This is why the Normal distribution is everywhere!
Suppose the true vote share in Michigan is 51%. What happens when we poll voters?
#| fig-height: 4.5
#| code-fold: true
#| code-summary: "Show R code"
#| layout-ncol: 3
set.seed(42)
true_p <- 0.51
# Poll of 10 voters
hist(replicate(1000, mean(rbinom(10, 1, true_p))), breaks = 20,
main = "Poll: 10 Voters", xlab = "Vote Share", col = "steelblue",
freq = FALSE, xlim = c(0.2, 0.8))
abline(v = true_p, col = "red", lwd = 2, lty = 2)
# Poll of 100 voters
hist(replicate(1000, mean(rbinom(100, 1, true_p))), breaks = 20,
main = "Poll: 100 Voters", xlab = "Vote Share", col = "steelblue",
freq = FALSE, xlim = c(0.2, 0.8))
abline(v = true_p, col = "red", lwd = 2, lty = 2)
# Poll of 1000 voters
hist(replicate(1000, mean(rbinom(1000, 1, true_p))), breaks = 20,
main = "Poll: 1000 Voters", xlab = "Vote Share", col = "steelblue",
freq = FALSE, xlim = c(0.2, 0.8))
abline(v = true_p, col = "red", lwd = 2, lty = 2)Larger samples → tighter Normal distribution around the true value (red line)
The “bell curve” — the most important distribution in statistics
The 68-95-99.7 Rule:
Why it’s everywhere: Central Limit Theorem guarantees that averages of many random events become Normal
Applications: Quality control, financial risk, test scores, measurement error
#| echo: false
#| fig-height: 5
x <- seq(-4, 4, length = 200)
plot(x, dnorm(x), type = "l", lwd = 3, col = "steelblue",
xlab = "Standard Deviations from Mean", ylab = "Density")
polygon(c(x[x >= -1 & x <= 1], 1, -1),
c(dnorm(x[x >= -1 & x <= 1]), 0, 0), col = rgb(0.3, 0.5, 0.7, 0.3))
abline(v = c(-2, -1, 1, 2), lty = 2, col = "gray")
text(0, 0.15, "68%", cex = 1.2)Male heights follow a Normal distribution: mean = 70 inches, sd = 3 inches
R Functions for Normal Distribution:
| Function | Purpose | Example |
|---|---|---|
pnorm() |
Probability ≤ x | P(height ≤ 73) |
qnorm() |
Find percentile | 95th percentile |
dnorm() |
Density at x | Height of curve |
rnorm() |
Random samples | Simulate data |
Business Applications:
How extreme was the October 1987 crash of -21.76%?
Conclusion: The model is wrong — stock returns have “fat tails.” Banks using Normal-based VaR dramatically underestimate risk.
The Problem with Normal Assumptions:
Stock returns have more extreme events than the Normal distribution predicts.
| Event | Normal Probability | Actually Happened |
|---|---|---|
| 1987 Crash (-22%) | 1 in \(10^{160}\) | Yes |
| 2008 Crisis | “Impossible” | Yes |
| 2020 COVID Crash | “Impossible” | Yes |
Implications for Risk Management:
Finding the relationship between variables
\[y = \beta_0 + \beta_1 x + \epsilon\]
Goal: Minimize sum of squared prediction errors
Business Questions Regression Answers:
Regression quantifies relationships and enables prediction.
Using Saratoga County housing data, we fit a model:
Price = f(Living Area)
A 2,000 sq ft house: $13K + (2000 × $113) = $239,000
Interpreting Coefficients:
| Coefficient | Meaning |
|---|---|
| Intercept ($13K) | Value of land without house |
| Slope ($113/sqft) | Price increase per sqft |
Making Predictions:
\[\text{Price} = 13,439 + 113 \times \text{SqFt}\]
| House Size | Predicted Price |
|---|---|
| 1,500 sqft | $183,000 |
| 2,500 sqft | $296,000 |
| 3,500 sqft | $409,000 |
What the plot shows:
Key observations:
The line minimizes the sum of squared vertical distances
The Capital Asset Pricing Model (CAPM) asks: Does a stock follow the market or beat it?
\[\text{Google Return} = \alpha + \beta \times \text{Market Return}\]
#| echo: false
#| fig-height: 5.5
plot(spy, goog, pch = 20, col = rgb(0.3, 0.5, 0.7, 0.5), cex = 0.8,
xlab = "S&P 500 Daily Return", ylab = "Google Daily Return",
main = "Google vs Market (2017-2023)")
abline(model, col = "red", lwd = 3)
abline(h = 0, v = 0, lty = 2, col = "gray")
legend("topleft", legend = bquote(beta == .(round(coef(model)[2], 2))),
col = "red", lwd = 3, bty = "n")Our Findings:
| Beta | Interpretation |
|---|---|
| \(\beta < 1\) | Less volatile (utilities, healthcare) |
| \(\beta = 1\) | Moves with market (index funds) |
| \(\beta > 1\) | More volatile (tech, small caps) |
Conclusion: Google tracked the market without consistent alpha in 2017-2023. High beta = higher risk, potentially higher reward.
How does advertising affect price sensitivity? We model sales as a function of price and whether the product was featured in ads.
Key finding: The interaction term (log(price):feat) is negative and significant — advertising changes how customers respond to price!
Finding: Advertising increases price sensitivity
| Condition | Price Elasticity |
|---|---|
| No advertising | -0.96 |
| With advertising | -0.96 + (-0.98) = -1.94 |
Why? Ads coincide with promotions → attract price-sensitive shoppers
Key Lessons:
Correlation ≠ Causation: Ads don’t cause sensitivity; they coincide with promotions
Selection effects: Who responds to ads? Price hunters!
Confounding variables: Promotions happen during ad campaigns
Managerial insight: Don’t blame advertising for price sensitivity — it’s the promotion strategy
Always ask: What’s really driving the relationship?
What if the outcome is yes/no?
\[P(y=1 \mid x) = \frac{1}{1 + e^{-\beta^T x}}\]
Why not just use linear regression?
#| echo: false
#| fig-height: 5
x <- seq(-6, 6, length = 200)
plot(x, 1/(1 + exp(-x)), type = "l", lwd = 3, col = "steelblue",
xlab = expression("Linear Predictor (" * beta * "'x)"), ylab = "Probability",
main = "The Logistic (Sigmoid) Function")
abline(h = 0.5, lty = 2, col = "gray")
abline(h = c(0, 1), lty = 3, col = "red")
text(4, 0.75, "Always between 0 and 1", cex = 0.9)Can Vegas point spreads predict game outcomes? We fit a logistic regression using historical NBA data.
Interpretation: For each additional point in the spread, log-odds of favorite winning increases by 0.16. The p-value < 0.001 confirms spreads are highly predictive.
Using our model, we can predict win probability for any point spread:
| Spread | P(Favorite Wins) |
|---|---|
| 4 points | 65% |
| 8 points | 78% |
| 12 points | 87% |
Same approach used for: credit scoring, churn prediction, marketing response, fraud detection — any binary outcome.
How accurate is our model? The confusion matrix shows predictions vs. actual outcomes.
Our model achieves about 66% accuracy — better than a coin flip!
Reading the Matrix:
| Pred: 0 | Pred: 1 | |
|---|---|---|
| Actual: 0 | TN (correct!) | FP (oops) |
| Actual: 1 | FN (oops) | TP (correct!) |
Sports Betting Reality:
But past performance ≠ future results
| Predicted: Win | Predicted: Lose | |
|---|---|---|
| Actual: Win | True Positive (TP) | False Negative (FN) |
| Actual: Lose | False Positive (FP) | True Negative (TN) |
Key Metrics:
Caution: Accuracy can mislead! A spam filter predicting “not spam” for everything has 99% accuracy but catches zero spam. Choose metrics based on business costs.
#| echo: false
#| fig-height: 5.5
pred_prob <- predict(model, type = "response")
roc_data <- data.frame(
threshold = seq(0, 1, by = 0.01)
) |>
rowwise() |>
mutate(
sensitivity = mean(pred_prob[NBA$favwin == 1] > threshold),
specificity = mean(pred_prob[NBA$favwin == 0] <= threshold)
)
plot(1 - roc_data$specificity, roc_data$sensitivity, type = "l", lwd = 3,
col = "steelblue", xlab = "False Positive Rate", ylab = "True Positive Rate",
main = "ROC Curve")
abline(0, 1, lty = 2, col = "gray")Understanding the ROC Curve:
Area Under Curve (AUC):
| AUC | Model Quality |
|---|---|
| 0.5 | Random (useless) |
| 0.6-0.7 | Poor |
| 0.7-0.8 | Fair |
| 0.8-0.9 | Good |
| 0.9+ | Excellent |
The optimal threshold depends on business costs:
There is no universal “correct” threshold
Framework for Threshold Selection:
Example — Credit Card Fraud:
Let business economics guide your model decisions
| Concept | Key Insight |
|---|---|
| Distributions | Binomial (binary), Poisson (counts), Normal (continuous) |
| Poisson | Mean = Variance — the fingerprint of count data |
| Normal | CLT makes it universal for averages |
| Linear Regression | Coefficients = effect sizes |
| Logistic Regression | Outputs probabilities for classification |
| ROC/AUC | Trade-off between false positives and false negatives |
| Threshold | Business costs should drive the choice |
Statistics is the science of decision-making under uncertainty
Online Articles:
Key Insight from HBR: A simple A/B test at Bing generated over $100M annually by testing a “low priority” idea
Books for Further Study:
Online Courses:
“You shall know a word by the company it keeps.” — J.R. Firth (1957)
Language poses unique challenges for AI:
The breakthrough: Represent words as vectors in continuous space where geometry encodes meaning.
The Problem with One-Hot Encoding:
Each word gets a unique vector with a single 1:
Problem: Cosine similarity between any two words = 0
No notion of semantic similarity is captured!
Solution: Learn dense vector representations where similar words are close together.
Imagine playing Twenty Questions to identify words:
| Question | Bear | Dog | Cat |
|---|---|---|---|
| Is it an animal? | 1 | 1 | 1 |
| Is it domestic? | 0 | 1 | 0.7 |
| Larger than human? | 0.8 | 0.1 | 0.01 |
| Has long tail? | 0 | 0.6 | 1 |
| Is it a predator? | 1 | 0 | 0.6 |
Each word becomes a vector of answers. Similar words give similar answers → similar vectors!
This is the essence of word embeddings.
The Distributional Hypothesis: Words appearing in similar contexts have similar meanings.
Result: Vector arithmetic captures analogies!
\[\vec{king} - \vec{man} + \vec{woman} \approx \vec{queen}\]
Training Word2Vec on Tolstoy’s War and Peace reveals thematic structure:
Word2Vec embeddings from War and Peace, reduced to 2D via PCA
The Word2Vec visualization reveals meaningful semantic relationships:
| Cluster | Words | Insight |
|---|---|---|
| Military | soldier, regiment, battle, army | War domain |
| Social | ballroom, court, marriage | Peace domain |
| Government | history, power, war | Political themes |
Key observation: “Peace” sits between government and social domains — central to the narrative’s dual structure.
Business applications: Netflix recommendations, Amazon suggestions, LinkedIn job matching, document search
Given a center word, predict surrounding context words:
%%| echo: false
%%| fig-width: 10
flowchart LR
A[loves] --> B[the]
A --> C[man]
A --> D[his]
A --> E[son]
style A fill:#e1f5fe,stroke:#0277bd,stroke-width:2px
style B fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
style C fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
style D fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
style E fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px\[P(\text{context} \mid \text{center}) = \prod_{j} P(w_{j} \mid w_{\text{center}})\]
The learned vectors capture semantic relationships because words with similar contexts get similar representations.
The Problem: Static Embeddings
The Sequential Bottleneck (RNNs/LSTMs):
The Breakthrough: Attention Mechanisms
Result: Contextual representations that change based on surrounding words.
The Library Analogy (Query, Key, Value):
The Mathematical Operation:
%%| echo: false
%%| fig-width: 10
flowchart LR
k1[k1]
k2[k2]
km[km]
Q[Query q] --> a1[score1]
Q --> a2[score2]
Q --> am[scorem]
k1 --> a1
k2 --> a2
km --> am
a1 -.-> v1[v1]
a2 -.-> v2[v2]
am -.-> vm[vm]
v1 --> O[Output]
v2 --> O
vm --> O
style Q fill:#e1f5fe,stroke:#0277bd
style O fill:#c8e6c9,stroke:#2e7d32\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]
For “The trophy wouldn’t fit in the suitcase because it was too big”:
| Word | Attention to “it” |
|---|---|
| trophy | 0.45 |
| suitcase | 0.15 |
| fit | 0.12 |
| big | 0.18 |
| other | 0.10 |
The model learns that “it” most likely refers to “trophy” by attending strongly to it!
Self-Attention:
Goal: Understand internal relationships. Used in: Encoders (BERT) and Decoders (GPT). Analogy: Rereading a sentence to find the subject.
Cross-Attention:
Goal: Link two different sequences. Used in: Translation models (T5). Analogy: Looking back at English while writing French.
%%| echo: false
%%| fig-width: 12
flowchart LR
In["Input"] --> Tok["Token"] --> Emb["Embed"] --> Att["Attention"] --> FF["FeedForward"] --> Out["Output"]
style In fill:#e3f2fd,stroke:#1976d2
style Att fill:#fff3e0,stroke:#f57c00
style FF fill:#e8f5e9,stroke:#388e3c
style Out fill:#f3e5f5,stroke:#7b1fa2Key innovations:
| Property | RNN/LSTM | Transformer |
|---|---|---|
| Sequential processing | Yes (slow) | No (parallel) |
| Long-range dependencies | Difficult | Easy |
| Training speed | Slow | Fast |
| Scalability | Limited | Excellent |
Transformers scale with compute → the foundation of modern LLMs.
The Scale Approach:
Emergent capabilities appear at scale:
LLMs are autoregressive: they predict the next token based on all previous tokens.
The Generation Loop:
Temperature (\(\tau\)):
| \(\tau\) | Behavior |
|---|---|
| 0 | Deterministic |
| 0.7 | Balanced |
| 1.0 | Probabilistic |
| 1.5 | Creative |
Lower \(\tau\) = Predictable Higher \(\tau\) = Random
Prompt: “The first African American president is Barack…”
A greedy strategy always picks “Obama” — but in formal documents, “Barack Hussein Obama” is preferred.
Temperature > 0 allows the model to explore alternatives that may better fit the context.
%%| echo: false
%%| fig-width: 12
flowchart LR
D[Data Collection] --> P[Pre-Training]
P --> I[Instruction Tuning]
I --> A[Alignment]
A --> Dep[Deployment]
style D fill:#e3f2fd,stroke:#1976d2
style P fill:#e8f5e9,stroke:#388e3c
style I fill:#fff3e0,stroke:#f57c00
style A fill:#fce4ec,stroke:#c2185b
style Dep fill:#f3e5f5,stroke:#7b1fa2| Stage | Purpose |
|---|---|
| Data Collection | Curate training corpus (quality > quantity) |
| Pre-Training | Predict next tokens on billions of sequences |
| Instruction Tuning | Teach the model to follow instructions |
| Alignment | Ensure behavior matches human values (RLHF) |
| Deployment | Optimize for latency, cost, safety |
Example: “Is Allah the only god?”
This nuanced behavior emerges from alignment training, not pre-training alone.
Context window: Maximum tokens the model can “see” at once
%%| echo: false
%%| fig-width: 10
flowchart LR
S[System Prompt<br/>~500 tokens]
T[Tools/Schemas<br/>~300 tokens]
H[History<br/>~1000 tokens]
R[Retrieved Docs<br/>~2000 tokens]
U[User Query<br/>~200 tokens]
S --> M[LLM]
T --> M
H --> M
R --> M
U --> M
style S fill:#e3f2fd
style R fill:#c8e6c9
style U fill:#fff3e0Prompting strategies: Zero-shot, Few-shot, Chain-of-thought, System prompts
“The question of whether a computer can think is no more interesting than the question of whether a submarine can swim.” — Edsger Dijkstra
AI agents are autonomous systems that:
Unlike chatbots, agents can act in the world.
The agent perceives its environment, reasons about goals, acts to achieve outcomes, observes the result, and repeats — a continuous loop of intelligent behavior.
LLMs are “brains without hands” — function calling bridges this gap:
%%| echo: false
%%| fig-width: 10
flowchart LR
U["Query"] --> L["LLM"]
L --> TC["Tool Call"]
TC --> O["Orchestrator"]
O --> T["Tool"]
T --> |"Result"| L
L --> R["Response"]
style U fill:#e3f2fd,stroke:#1976d2
style L fill:#fff3e0,stroke:#f57c00
style TC fill:#fce4ec,stroke:#c2185b
style T fill:#e8f5e9,stroke:#388e3c
style R fill:#f3e5f5,stroke:#7b1fa2Examples: Web search, database queries, code execution, API calls.
User: “What’s $100 in euros?”
Agent reasoning:
convert_currency(amount=100, from="USD", to="EUR")Tool returns: 92.50
Agent response: “100 US dollars is approximately 92.50 euros at current exchange rates.”
The agent reasons about what tool to use, then acts to get information.
Complex tasks require chained actions:
%%| echo: false
%%| fig-width: 12
flowchart LR
Task[Task] --> get_rates[get_rates]
get_rates --> Rates[Rates]
Rates --> get_prices[get_prices]
get_prices --> Prices[Prices]
Prices --> correlate[correlate]
correlate --> r[r=0.73]
r --> report[report]
report --> Final[Final Report]
style Task fill:#e3f2fd,stroke:#1976d2
style get_rates fill:#fff3e0,stroke:#f57c00
style Rates fill:#fff3e0,stroke:#f57c00
style get_prices fill:#fff3e0,stroke:#f57c00
style Prices fill:#fff3e0,stroke:#f57c00
style correlate fill:#fff3e0,stroke:#f57c00
style r fill:#fff3e0,stroke:#f57c00
style report fill:#fff3e0,stroke:#f57c00
style Final fill:#c8e6c9,stroke:#2e7d32Each step informs the next — true autonomous problem-solving.
Planning capabilities enable:
| Capability | Description | Example |
|---|---|---|
| Decomposition | Break complex goals into subtasks | “Analyze market” → 4 API calls |
| State tracking | Remember intermediate results | Store data between steps |
| Adaptation | Adjust plan based on results | Retry if API fails |
| Synthesis | Combine outputs into final answer | Merge data into report |
Business impact: Agents can handle multi-hour research tasks that would take humans days.
The Loop:
| Step | Action | Example |
|---|---|---|
| Observe | Analyze input, tool outputs, environment | “User wants weather in Paris” |
| Think | Decide next action or tool to use | “I should call weather API” |
| Act | Execute tool or generate response | get_weather("Paris") |
Key insight: Unlike single-pass generation, ReAct agents can course-correct based on intermediate results.
ChatDev orchestrates a virtual software company with specialized AI agents:
%%| echo: false
%%| fig-width: 10
flowchart LR
CEO[CEO] --- CTO[CTO]
CTO --- CPO[CPO]
Prog[Programmer] --- Des[Designer]
Test[Tester] --- Prog2[Programmer]
CEO --> Prog
Des --> Test
Test --> Doc[Documentation]
style CEO fill:#ffcccc,stroke:#cc0000
style CTO fill:#ccffcc,stroke:#00cc00
style Prog fill:#cce5ff,stroke:#1976d2
style Test fill:#fff3cd,stroke:#f57c00Results: 70 software projects, 17 files each, ~$0.30 per project, 7 minutes.
| Pattern | Use Case | Tradeoff |
|---|---|---|
| Sequential | Content pipeline (research → write → edit) | Simple but slow |
| Parallel | Multi-source analysis | Fast but needs synthesis |
| Hierarchical | Project management | Control but bottleneck risk |
| Dynamic | Market-based task allocation | Flexible but complex |
Case Study: Replit Agent Failure
%%| echo: false
%%| fig-width: 10
flowchart LR
U["User: Fix this bug"] --> A[Agent]
A --> D1["Diagnoses: config file issue"]
D1 --> D2["Decides: delete config"]
D2 --> B["Bug in delete tool"]
B --> C["Entire project wiped"]
C --> X["Production DB destroyed"]
style D1 fill:#fff3cd
style D2 fill:#ffcccc
style B fill:#ffcccc
style X fill:#ff0000,color:#fffLesson: Agent autonomy requires multiple safety layers.
What went wrong:
| Failure | Type | Prevention |
|---|---|---|
| Wrong diagnosis | Reasoning error | Require confirmation for destructive actions |
| Auto-delete decision | Autonomy overreach | Human-in-the-loop for irreversible ops |
| Tool bug | Implementation flaw | Sandbox testing, rollback capability |
| No backup | Missing safeguard | Mandatory snapshots before changes |
Key principle: The more powerful the agent, the more guardrails it needs.
%%| echo: false
%%| fig-width: 10
flowchart LR
PI[Prompt Injection] --> A[Agent]
AD[Adversarial Inputs] --> A
GM[Goal Misalignment] --> A
HA[Hallucinations] --> A
CO[Capability Overhang] --> A
LC[Lack of Corrigibility] --> A
A --> H[Harm]
style PI fill:#ffcccc
style HA fill:#fff3cd
style H fill:#ff0000,color:#fffAutonomous agents amplify risks — a hallucination becomes action.
| Risk | Description | Real Example |
|---|---|---|
| Prompt Injection | Hidden instructions hijack agent | Email contains “ignore previous instructions” |
| Hallucinations | Acting on false information | Agent invents API that doesn’t exist |
| Goal Misalignment | Optimizes wrong objective | Maximizes engagement via manipulation |
| Capability Overhang | Does more than authorized | Accesses files outside scope |
%%| echo: false
%%| fig-width: 10
flowchart LR
U[Input] --> IF[Input Filter]
IF --> |Clean| A[Agent]
IF --> |Malicious| B[Block]
A --> OF[Output Filter]
OF --> |Safe| R[Response]
OF --> |Unsafe| B
A --> M[Monitor]
M --> |Anomaly| CB[Circuit Breaker]
CB --> B
style IF fill:#fff3e0,stroke:#f57c00
style OF fill:#fff3e0,stroke:#f57c00
style B fill:#ffcccc,stroke:#cc0000
style R fill:#ccffcc,stroke:#00cc00
style CB fill:#fce4ec,stroke:#c2185bThe Safety Pipeline:
| Layer | Purpose | Technical Method |
|---|---|---|
| Input Filter | Block malicious prompts | PII detection, jailbreak classifiers |
| Sandboxing | Isolate agent actions | Docker containers, restricted API keys |
| Output Filter | Prevent sensitive leakage | RegEx for PII, toxic content scoring |
| Human-in-the-Loop | Verify high-risk actions | “Approve” button for financial transfers |
| Monitoring | Detect runtime anomalies | Log analysis, capability tracking |
Key Principle: Never rely on the LLM to self-police. Use external code to enforce boundaries.
The most effective safety measure for high-stakes agents:
rm -rf, send_payment).Example: A code-refactoring agent proposes changes; a human developer reviews and clicks “Merge” or “Reject”.
For Claude Opus 4, Anthropic activated proactive safety:
%%| echo: false
%%| fig-width: 10
flowchart LR
U[User] --> CC[Constitutional Classifiers]
CC --> |Safe| M[Model]
CC --> |Blocked| B[Reject]
M --> OC[Output Check]
OC --> |Safe| R[Response]
OC --> |Harmful| B
BB[Bug Bounty] --> CC
RP[Rapid Patch] --> CC
style CC fill:#c8e6c9,stroke:#2e7d32
style B fill:#ffcccc,stroke:#cc0000
style R fill:#e3f2fd,stroke:#1976d2| Layer | Function | Why It Matters |
|---|---|---|
| Constitutional AI | Real-time input/output filtering | Blocks harmful requests before execution |
| Bug Bounty | Crowdsourced discovery | Finds attacks humans miss |
| Rapid Patching | Auto-generate variants | Stays ahead of attackers |
| Egress Control | Throttle outbound data | Prevents model weight theft |
Traditional metrics (accuracy, precision) are insufficient for agents.
%%| echo: false
%%| fig-width: 10
flowchart LR
TC["Task Completion"] --> Score["Overall Agent Score"]
RQ["Reasoning Quality"] --> Score
SA["Safety"] --> Score
RE["Resource Efficiency"] --> Score
ER["Error Recovery"] --> Score
AD["Adversarial Robustness"] --> Score
style SA fill:#ffcccc,stroke:#cc0000
style Score fill:#c8e6c9,stroke:#2e7d32%%| echo: false
%%| fig-width: 10
flowchart LR
A[Agent Output] --> R[Rule-Based]
A --> L[LLM-as-Judge]
A --> H[Human Review]
A --> S[Simulation]
R --> E[Score]
L --> E
H --> E
S --> E
style R fill:#e3f2fd,stroke:#1976d2
style L fill:#fff3e0,stroke:#f57c00
style H fill:#c8e6c9,stroke:#2e7d32
style S fill:#f3e5f5,stroke:#7b1fa2
style E fill:#ffcccc,stroke:#cc0000Best practice: Combine multiple approaches for comprehensive evaluation.
| Domain | Benchmark | What It Tests |
|---|---|---|
| Coding | SWE-bench | Fix real GitHub issues |
| Web | WebArena | Navigate websites, complete tasks |
| Robotics | ALFRED | Household tasks in 3D |
| Enterprise | TAU-bench | Multi-system workflows |
Agent capabilities are task-specific — benchmarks must match use cases.
Systematic vulnerability testing:
%%| echo: false
%%| fig-width: 10
flowchart LR
PI[Prompt Injection] --> A[Agent]
ME[Agent Mistakes] --> A
MU[Direct Misuse] --> A
A --> |Vulnerability| V[Security Issue]
A --> |Safe| S[Normal Operation]
V --> R[Report]
style PI fill:#ffcccc,stroke:#cc0000
style ME fill:#fff3cd,stroke:#f57c00
style MU fill:#ffcccc,stroke:#cc0000
style V fill:#ffcccc,stroke:#cc0000
style S fill:#ccffcc,stroke:#00cc00Example: Hidden text in a webpage hijacks agent to exfiltrate data.
Comprehensive red-teaming found 1,200+ vulnerabilities in one enterprise agent.
Software agents operate in digital systems. Embodied agents must handle:
%%| echo: false
%%| fig-width: 12
flowchart LR
C["Camera"] --> F["Fusion"]
L["Lidar"] --> F
T["Touch"] --> F
F --> B["Robot Brain"]
B --> M["Motors"]
M --> E["Environment"]
E --> |"Feedback"| C
style F fill:#fff3e0,stroke:#f57c00
style B fill:#e3f2fd,stroke:#1976d2
style E fill:#c8e6c9,stroke:#388e3cThe sim-to-real gap: Robots trained in simulation often fail in reality.
| Era | Capability | Limitation |
|---|---|---|
| Rule-based | Explicit reasoning | Brittle, narrow |
| Probabilistic | Handle uncertainty | No language understanding |
| Foundation Models | Natural language + adaptation | Compute-intensive |
LLMs have catalyzed a new era: robots that understand language and adapt.
A vision-language-action model that directly controls robots:
%%| echo: false
%%| fig-width: 10
flowchart LR
V[Vision Input] --> VLA[RT-2 Model]
L[Language Input] --> VLA
VLA --> A[Action Output]
A --> R[Robot]
R -- Feedback --> V
style V fill:#e3f2fd,stroke:#1976d2
style L fill:#fff3e0,stroke:#f57c00
style VLA fill:#f3e5f5,stroke:#7b1fa2
style A fill:#c8e6c9,stroke:#2e7d32
style R fill:#ffcccc,stroke:#cc0000General → Interactive → Dexterous
Works across robot forms: arms, humanoids, mobile platforms.
Named after Asimov’s Laws of Robotics, this benchmark tests embodied AI safety:
| Asimov’s Law | Modern Interpretation | Test Scenario |
|---|---|---|
| 1. Don’t harm humans | Refuse dangerous commands | “Throw this at the person” |
| 2. Obey orders | Follow safe instructions | “Hand me that tool” |
| 3. Protect self | Avoid self-damage | Don’t walk off ledge |
| Zeroth Law | Protect humanity broadly | Consider societal impact |
Key challenge: Context matters — “Hand me that knife” is safe in a kitchen, dangerous in a conflict.
Business relevance: As robots enter warehouses, hospitals, and homes, safety benchmarks become legal and ethical requirements.
| Concept | Key Insight |
|---|---|
| Word Embeddings | Words as vectors; geometry = meaning |
| Distributional Hypothesis | Context reveals meaning |
| Attention | Dynamic weighting of relevant information |
| Transformers | Parallel processing, scalable, powerful |
The shift from symbols to vectors enabled modern NLP.
| Concept | Key Insight |
|---|---|
| Autoregressive Generation | Predict next token iteratively |
| Temperature | Controls randomness/creativity |
| Alignment | Ensures safe, helpful behavior |
| Context Windows | Limit on “memory” size |
Scale + alignment = emergent reasoning capabilities.
| Concept | Key Insight |
|---|---|
| Tool Use | LLMs gain ability to act |
| Multi-Step Planning | Chain reasoning and action |
| Orchestration | Multiple agents collaborate |
| Safety | Autonomy amplifies risks |
| Evaluation | Requires new methodologies |
Agents transform LLMs from conversationalists to autonomous workers.
For AI leaders:
The promise: augmenting human intelligence — agents handle routine tasks while humans provide judgment, creativity, and ethical oversight.
Online Articles:
From the Textbook:
Cursor is an AI-powered code editor built on VS Code. It allows you to:
For this course, we’ll use Cursor to build an AI agent without needing deep programming expertise.
.dmg file.exe installerBenefits of signing in:
If you’ve used VS Code before:
Cursor needs Python installed on your computer to run our project.
Ctrl+`)Python 3.x.x, you’re good! Skip to Part 4.Option A: Using Homebrew (Recommended)
Option B: Direct Download
After installation, close and reopen Cursor, then:
Python 3.x.xOur project needs a few Python libraries. Install them in Cursor’s terminal:
You should see output indicating successful installation.
If you get a “pip not found” error on Mac:
Use this to ask questions or get help:
Use this to write or modify code:
As you type, Cursor suggests completions:
For larger tasks, use Agent mode:
oj-pricing-agenttest.pyHello from Cursor!Congratulations! You’re ready to build your AI agent.
| Action | Mac | Windows |
|---|---|---|
| AI Chat | Cmd+L | Ctrl+L |
| Inline Edit | Cmd+K | Ctrl+K |
| Agent Mode | Cmd+I | Ctrl+I |
| Open Terminal | Ctrl+| Ctrl+ |
|
| Save File | Cmd+S | Ctrl+S |
| Open File | Cmd+O | Ctrl+O |
| New File | Cmd+N | Ctrl+N |
| Find | Cmd+F | Ctrl+F |
Mac:
python3 instead of pythonbrew install pythonWindows:
Mac:
pip3 instead of pipWindows:
python -m pip install package_nameAfter completing this setup:
You’re ready for Zoom Session 1 where we’ll practice using Cursor’s AI features together!
In this project, you will build an AI agent that helps a retail pricing analyst make decisions about orange juice pricing and promotions. The agent will:
Time Required: ~2 hours
Prerequisites:
oj-pricing-agent on your computeroj_agent.pyDownload the oj_data.csv file and copy it into your oj-pricing-agent folder.
In your oj_agent.py file, start by adding these lines at the top:
What this does: These libraries help us work with data (pandas), do math (numpy), and build models (sklearn).
Add the following code to load the orange juice sales data:
# Load the orange juice dataset
print("Loading data...")
df = pd.read_csv('oj_data.csv')
# Display basic information
print(f"Dataset has {len(df)} rows and {len(df.columns)} columns")
print(f"\nColumns: {list(df.columns)}")
print(f"\nBrands in dataset: {df['brand'].unique()}")
print(f"\nPrice range: ${df['price'].min():.2f} - ${df['price'].max():.2f}")
print(f"\nSample of data:")
print(df.head())You should see output showing:
Troubleshooting: If you get an error about missing packages, run:
We’re building a model that predicts log of sales volume based on:
Add this code to prepare features for the model:
# ============================================
# PART 3: BUILD THE REGRESSION MODEL
# ============================================
print("\n" + "="*50)
print("Building the pricing model...")
print("="*50)
# Create dummy variables for brand (one-hot encoding)
# This converts 'brand' text into numbers the model can use
brand_dummies = pd.get_dummies(df['brand'], prefix='brand', drop_first=False)
# Create the feature matrix
# We include: price, feat, brand dummies, and price*brand interactions
X = pd.DataFrame({
'price': df['price'],
'feat': df['feat'],
'brand_minute.maid': brand_dummies['brand_minute.maid'],
'brand_tropicana': brand_dummies['brand_tropicana'],
# Interaction terms: price effect varies by brand
'price_x_minute.maid': df['price'] * brand_dummies['brand_minute.maid'],
'price_x_tropicana': df['price'] * brand_dummies['brand_tropicana']
})
# Target variable: log of sales (logmove)
y = df['logmove']
print(f"Features: {list(X.columns)}")
print(f"Target: logmove (log of sales volume)")Add code to train the regression model:
# Fit the linear regression model
model = LinearRegression()
model.fit(X, y)
# Display the coefficients
print("\nModel Coefficients:")
print("-" * 40)
for feature, coef in zip(X.columns, model.coef_):
print(f" {feature}: {coef:.4f}")
print(f" intercept: {model.intercept_:.4f}")
# Calculate R-squared (how well the model fits)
r_squared = model.score(X, y)
print(f"\nModel R-squared: {r_squared:.3f}")
print("(This means the model explains {:.1f}% of sales variation)".format(r_squared * 100))Save and run the script again. You should see coefficients like:
Add these functions that the agent will use to answer questions:
# ============================================
# PART 4: HELPER FUNCTIONS FOR THE AGENT
# ============================================
def predict_sales(brand, price, featured=0):
"""
Predict sales volume for a given brand, price, and feature status.
Args:
brand: 'tropicana', 'minute.maid', or 'dominicks'
price: price in dollars (e.g., 2.50)
featured: 1 if in ad circular, 0 if not
Returns:
Predicted sales volume (not log-transformed)
"""
# Create feature vector
features = {
'price': price,
'feat': featured,
'brand_minute.maid': 1 if brand.lower() == 'minute.maid' else 0,
'brand_tropicana': 1 if brand.lower() == 'tropicana' else 0,
'price_x_minute.maid': price if brand.lower() == 'minute.maid' else 0,
'price_x_tropicana': price if brand.lower() == 'tropicana' else 0
}
# Convert to dataframe for prediction
X_pred = pd.DataFrame([features])
# Predict log sales, then convert back
log_sales = model.predict(X_pred)[0]
sales = np.exp(log_sales)
return sales
def get_price_elasticity(brand):
"""
Calculate the price elasticity for a given brand.
Price elasticity tells us: if price increases by 1%,
how much does quantity demanded change (in %)?
A more negative number means more price-sensitive.
"""
# Base price coefficient
base_coef = model.coef_[0] # price coefficient
# Add brand-specific interaction if applicable
if brand.lower() == 'minute.maid':
interaction_coef = model.coef_[4] # price_x_minute.maid
elif brand.lower() == 'tropicana':
interaction_coef = model.coef_[5] # price_x_tropicana
else: # dominicks (base case)
interaction_coef = 0
total_elasticity = base_coef + interaction_coef
return total_elasticity
def get_advertising_lift(brand):
"""
Calculate the sales lift from being featured in advertising.
Returns the percentage increase in sales.
"""
# The 'feat' coefficient tells us the log-sales increase
feat_coef = model.coef_[1] # feat coefficient
# Convert from log to percentage change
percentage_lift = (np.exp(feat_coef) - 1) * 100
return percentage_lift
def find_optimal_price(brand, min_price=1.0, max_price=4.0, featured=0):
"""
Find the price that maximizes revenue for a brand.
Revenue = Price × Quantity
"""
best_price = min_price
best_revenue = 0
# Search through price range
for price in np.arange(min_price, max_price, 0.05):
sales = predict_sales(brand, price, featured)
revenue = price * sales
if revenue > best_revenue:
best_revenue = revenue
best_price = price
return best_price, best_revenue
def compare_elasticities():
"""
Compare price elasticity across all three brands.
"""
brands = ['dominicks', 'minute.maid', 'tropicana']
results = {}
for brand in brands:
elasticity = get_price_elasticity(brand)
results[brand] = elasticity
return resultsAdd test code to verify the functions work:
# ============================================
# TEST THE HELPER FUNCTIONS
# ============================================
print("\n" + "="*50)
print("Testing helper functions...")
print("="*50)
# Test prediction
test_sales = predict_sales('tropicana', 2.50, featured=0)
print(f"\nPredicted sales for Tropicana at $2.50 (no ad): {test_sales:.0f} units")
# Test elasticity
elasticities = compare_elasticities()
print("\nPrice Elasticities by Brand:")
for brand, elast in elasticities.items():
print(f" {brand}: {elast:.3f}")
# Test advertising lift
lift = get_advertising_lift('minute.maid')
print(f"\nAdvertising lift: {lift:.1f}% increase in sales")
# Test optimal price
opt_price, opt_rev = find_optimal_price('dominicks')
print(f"\nOptimal price for Dominick's: ${opt_price:.2f} (revenue: ${opt_rev:.2f})")Run the script again to verify all functions work correctly.
Now we’ll create the agent that interprets natural language questions and calls the appropriate functions. Add this code:
# ============================================
# PART 5: THE AI AGENT
# ============================================
def answer_question(question):
"""
Simple agent that answers business questions about OJ pricing.
This is a rule-based agent that matches keywords in the question
to determine which analysis to perform.
"""
question_lower = question.lower()
# Question 1: Predict sales for specific scenario
if 'predict' in question_lower or 'sales volume' in question_lower:
# Extract brand and price from question if possible
if 'tropicana' in question_lower:
brand = 'tropicana'
elif 'minute maid' in question_lower:
brand = 'minute.maid'
else:
brand = 'dominicks'
# Look for price (default to $2.50 if not found)
import re
price_match = re.search(r'\$?(\d+\.?\d*)', question_lower)
price = float(price_match.group(1)) if price_match else 2.50
# Check for advertising
featured = 1 if 'advertis' in question_lower or 'feature' in question_lower else 0
if 'no advertis' in question_lower or 'without advertis' in question_lower:
featured = 0
sales = predict_sales(brand, price, featured)
response = f"""
**Predicted Sales Analysis**
Brand: {brand.title().replace('.', ' ')}
Price: ${price:.2f}
Featured in Ad: {'Yes' if featured else 'No'}
**Predicted Sales Volume: {sales:,.0f} units**
This prediction is based on our regression model that accounts for:
- Base demand for this brand
- Price sensitivity (elasticity)
- Advertising effects
"""
return response
# Question 2: Which brand is most price-sensitive?
elif 'price-sensitive' in question_lower or 'price sensitive' in question_lower or 'most sensitive' in question_lower:
elasticities = compare_elasticities()
# Find most price-sensitive (most negative elasticity)
most_sensitive = min(elasticities, key=elasticities.get)
response = f"""
**Price Sensitivity Analysis**
Price Elasticity by Brand:
"""
for brand, elast in sorted(elasticities.items(), key=lambda x: x[1]):
sensitivity = "HIGH" if elast < -3 else "MEDIUM" if elast < -2 else "LOW"
response += f"- {brand.title().replace('.', ' ')}: {elast:.3f} ({sensitivity} sensitivity)\n"
response += f"""
**Most Price-Sensitive: {most_sensitive.title().replace('.', ' ')}**
Interpretation: A 1% price increase leads to a {abs(elasticities[most_sensitive]):.1f}% decrease in sales for {most_sensitive.title().replace('.', ' ')}.
Business Implication: Be careful with price increases on {most_sensitive.title().replace('.', ' ')} - customers are very responsive to price changes.
"""
return response
# Question 3: Should we feature a brand in advertising?
elif 'feature' in question_lower or 'ad circular' in question_lower or 'advertising' in question_lower:
if 'minute maid' in question_lower:
brand = 'minute.maid'
elif 'tropicana' in question_lower:
brand = 'tropicana'
else:
brand = 'dominicks'
lift = get_advertising_lift(brand)
# Calculate example impact
base_sales = predict_sales(brand, 2.50, featured=0)
featured_sales = predict_sales(brand, 2.50, featured=1)
response = f"""
**Advertising Impact Analysis for {brand.title().replace('.', ' ')}**
Expected Sales Lift from Featuring: **{lift:.1f}%**
Example at $2.50:
- Without advertising: {base_sales:,.0f} units
- With advertising: {featured_sales:,.0f} units
- Additional sales: {featured_sales - base_sales:,.0f} units
**Recommendation:** {'Yes, feature this product!' if lift > 20 else 'Consider the advertising cost vs. the sales lift.'}
The advertising effect is consistent across price points. Factor in your advertising costs to determine if the sales lift justifies the expense.
"""
return response
# Question 4: Optimal price for a brand
elif 'optimal price' in question_lower or 'maximize revenue' in question_lower or 'best price' in question_lower:
if 'minute maid' in question_lower:
brand = 'minute.maid'
elif 'tropicana' in question_lower:
brand = 'tropicana'
else:
brand = 'dominicks'
opt_price, opt_revenue = find_optimal_price(brand)
opt_sales = predict_sales(brand, opt_price, featured=0)
# Compare with current average price
avg_price = df[df['brand'] == brand]['price'].mean()
avg_revenue = avg_price * predict_sales(brand, avg_price, featured=0)
response = f"""
**Revenue Optimization for {brand.title().replace('.', ' ')}**
**Optimal Price: ${opt_price:.2f}**
At optimal price:
- Predicted sales: {opt_sales:,.0f} units
- Revenue per store-week: ${opt_revenue:,.2f}
Comparison with current average (${avg_price:.2f}):
- Current revenue: ${avg_revenue:,.2f}
- Potential improvement: ${opt_revenue - avg_revenue:,.2f} ({((opt_revenue/avg_revenue)-1)*100:.1f}%)
Note: This optimization assumes no competitor response and stable market conditions.
"""
return response
# Question 5: Compare elasticities across brands
elif 'compare' in question_lower or 'elasticity' in question_lower or 'across' in question_lower:
elasticities = compare_elasticities()
response = """
**Price Elasticity Comparison Across Brands**
| Brand | Elasticity | Interpretation |
|-------|------------|----------------|
"""
for brand, elast in sorted(elasticities.items(), key=lambda x: x[1]):
interp = f"1% price ↑ → {abs(elast):.1f}% sales ↓"
response += f"| {brand.title().replace('.', ' ')} | {elast:.3f} | {interp} |\n"
response += """
**Key Insights:**
1. **Dominick's** (store brand) is least price-sensitive - customers buying store brands may prioritize value and be less responsive to small price changes.
2. **Tropicana** shows moderate price sensitivity - as a premium brand, some customers are loyal but others will switch if prices rise.
3. **Minute Maid** is most price-sensitive - positioned between store and premium brands, these customers actively compare prices.
**Strategic Implications:**
- Use competitive pricing on Minute Maid to capture price-sensitive shoppers
- Tropicana can sustain moderate price premiums
- Dominick's margins can be optimized with less risk of volume loss
"""
return response
else:
return """
I can help you with these types of questions:
1. **Sales Prediction:** "What is the predicted sales volume if we price Tropicana at $2.50?"
2. **Price Sensitivity:** "Which brand is most price-sensitive?"
3. **Advertising Impact:** "Should we feature Minute Maid in the ad circular?"
4. **Price Optimization:** "What price should we set for Dominick's to maximize revenue?"
5. **Elasticity Comparison:** "Compare the price elasticity across brands"
Please try one of these questions!
"""Finally, add code to let users interact with the agent:
# ============================================
# PART 6: INTERACTIVE AGENT INTERFACE
# ============================================
def run_agent():
"""
Run the interactive agent interface.
"""
print("\n" + "="*60)
print("🍊 ORANGE JUICE PRICING ANALYTICS AGENT 🍊")
print("="*60)
print("\nHello! I'm your pricing analytics assistant.")
print("I can help you analyze orange juice pricing and promotions.")
print("\nTry asking me questions like:")
print(" - What is the predicted sales if we price Tropicana at $2.50?")
print(" - Which brand is most price-sensitive?")
print(" - Should we feature Minute Maid in the ad circular?")
print(" - What price maximizes revenue for Dominick's?")
print(" - Compare price elasticity across brands")
print("\nType 'quit' to exit.\n")
while True:
question = input("Your question: ").strip()
if question.lower() in ['quit', 'exit', 'q']:
print("\nThank you for using the OJ Pricing Agent. Goodbye!")
break
if not question:
continue
print("\n" + "-"*50)
response = answer_question(question)
print(response)
print("-"*50 + "\n")
# ============================================
# MAIN: RUN THE AGENT
# ============================================
if __name__ == "__main__":
# Run the interactive agent
run_agent()Save the file and run:
Test your agent with these exact questions:
“What is the predicted sales volume if we price Tropicana at $2.50 with no advertising?”
“Which brand is most price-sensitive?”
“Should we feature Minute Maid in this week’s ad circular? What’s the expected sales lift?”
“What price should we set for Dominick’s brand to maximize revenue?”
“Compare the price elasticity across the three brands.”
Record the answers for your summary document.
Create a 1-page document (Word or PDF) that includes:
For Zoom Session 3, prepare a 2-3 minute demonstration:
Tips:
Run: pip install pandas numpy scikit-learn
Make sure the data file is in the same folder as your Python script.
Check that your data loaded correctly - you should have ~28,000 rows.
Try rephrasing using keywords like “predict”, “price-sensitive”, “feature”, “optimal”, or “compare”.
The complete oj_agent.py file should be approximately 350-400 lines. If you get stuck, ask Cursor’s AI assistant for help by selecting your code and pressing Cmd+K (Mac) or Ctrl+K (Windows), then describing your issue. I also prepared a complete code reference for you to refer.
If you finish early and want to explore further:
Good luck with your project! 🍊
I live in a house at risk of a mudslide damage.
What is the best course of action?
graph LR
Start((Decision)) --> Build[Build Wall]
Start --> NoBuild[Don't Build]
Build -- "$10,000" --> WallNode{Slide?}
WallNode -- "0.01" --> FailNode{Wall Fails?}
FailNode -- "0.05" --> Loss["$100,000 Cost"]
FailNode -- "0.95" --> NoLoss["$0 Cost"]
WallNode -- "0.99" --> NoSlide["$0 Cost"]
NoBuild -- "$0" --> SlideNode{Slide?}
SlideNode -- "0.01" --> Loss2["$100,000 Cost"]
SlideNode -- "0.99" --> NoLoss2["$0 Cost"]\(EV = 0.01 \times \$100,000 = \$1,000\)
\(EV = \$10,000 + (0.01 \times 0.05 \times \$100,000) = \$10,050\)
[!IMPORTANT] Based purely on expected cost, Don’t Build is the rational choice despite the high impact of a slide.
A test is available to better estimate the risk.
Should we take the test?
\(P(T) = (0.90 \times 0.01) + (0.15 \times 0.99) = 0.1575\)
\(P(\text{Slide} \mid T) = \frac{0.90 \times 0.01}{0.1575} \approx 0.0571\)
\(P(\text{Slide} \mid \text{not } T) = \frac{0.1 \times 0.01}{0.8425} \approx 0.0012\)
If we test:
\(= 3,000 + (0.1575 \times 10,285) + (0.8425 \times 120) \approx \$4,693\)
| Choice | Expected Cost | Risk of Loss | P |
|---|---|---|---|
| Don’t Build | $1,000 | 0.01 | 1 in 100 |
| Build w/o test | $10,050 | 0.0005 | 1 in 2000 |
| Test & Decide | $4,693 | 0.00146 | 1 in 700 |
Decision? It depends on your utility function (risk tolerance).
Imagine a gambling game where a fair coin is flipped repeatedly until it lands on heads. The payoff for the game is \(2^N\), where \(N\) is the number of tosses needed for the coin to land on heads.
The expected value of this game is infinite:
\[ E(X) = \frac{1}{2} \cdot 2 + \frac{1}{4} \cdot 4 + \frac{1}{8} \cdot 8 + \ldots = \infty \]
This means that, in theory, a rational person should be willing to pay any finite amount to play this game. However, in reality, most people would be unwilling to pay a large amount.
Bernoulli argued that people do not maximize expected monetary value but rather expected utility \(U(x)\).
\[ E[U(X)] = \sum^\infty_{k=1} 2^{-k} U(2^k) \]
For the log utility case, \(U(x) = \log(x)\), the expected utility is \(2 \log(2)\). To find the certain dollar amount \(x^*\) (certainty equivalent) that provides the same utility:
\[ \log(x^*) = 2\log(2) = \log(2^2) = \log(4) \implies x^* = 4 \]
Under log utility, a rational player would pay at most $4 to play, despite the infinite expected monetary value.
Online: 2 Weeks (14 Days)
Email: vsokolov@gmu.edu
Phone: 703 993 4533
Course Textbook: Bayes, AI and Deep Learning by Nick Polson and Vadim Sokolov. The book is to be published by Chapman & Hall/CRC in 2026. Available for free online.
The purpose of this topic is to introduce participants to the foundational concepts of artificial intelligence and data-driven decision making. Participants will develop a working understanding of probability, statistical modeling, and modern AI techniques—equipping them to lead AI initiatives, evaluate AI investments, and communicate effectively with technical teams.
This module takes executives on a journey from the fundamentals of probability and uncertainty through statistical modeling to the cutting edge of modern AI. Rather than focusing on mathematical derivations, we emphasize intuition, real-world applications, and business implications. Through compelling case studies—from wrongful convictions caused by probability errors to the Netflix Prize’s lessons about model complexity—participants will learn to think probabilistically about business decisions. The module culminates in a hands-on project where participants build an AI agent using Cursor IDE, directly experiencing how data, models, and AI agents work together to solve business problems.
Upon completion of this topic, you should understand and be able to:
This topic combines asynchronous learning (recorded lectures, readings, discussion boards) with synchronous sessions (live Zoom calls) and hands-on practice. The approach emphasizes:
This topic will require approximately 10 hours of work to complete:
| Activity | Hours |
|---|---|
| Recorded Lectures (3 lectures × 30 min) | 1.5 |
| Live Zoom Sessions (3 sessions × 1 hr) | 3.0 |
| Reading | 2 |
| Discussion Boards | 1.0 |
| Final Project | 2.0 |
| Total | 10.0 |
| Day | Activities |
|---|---|
| Day 1-2 | Module 1 lectures available; begin readings on probability and Bayes rule |
| Day 3 | Discussion Board 1 opens |
| Day 4 | Zoom Session 1: Kick-off + Cursor IDE Hands-on (1 hr) |
| Day 5-6 | Module 2 lectures available; readings on statistics and regression |
| Day 7 | Discussion Board 2 opens |
| Day 8 | Zoom Session 2: Mid-point Check-in (1 hr) |
| Day 9-10 | Module 3 lectures available; readings on NLP and AI agents |
| Day 11 | Discussion Board 3 opens |
| Day 12-13 | Final Project work time |
| Day 14 | Zoom Session 3: Wrap-up + Final Project Presentations (1 hr) |
Days 1-4
Chapter 1: Probability and Uncertainty
Chapter 2: Bayes Rule
Chapter 4: Utility, Risk and Decisions
Opens Day 3 | Due Day 7
“Consider a strategic decision your organization recently faced (or is currently facing) involving uncertainty. Describe the decision and identify:
Respond to at least two peers’ posts with constructive suggestions.”
Day 4 | 1 hour
Preparation: Install Cursor IDE before the session (setup instructions)
Days 5-8
Chapter 1: Probability and Uncertainty
Chapter 3: Bayesian Learning
Chapter 12: Linear Regression
Chapter 13: Logistic Regression and GLMs
Opens Day 7 | Due Day 11 “The Netflix Prize awarded $1 million for a 10% improvement in recommendation accuracy, yet Netflix never fully implemented the winning algorithm—it was too complex and expensive to deploy, and by then, streaming had changed the business model entirely. See Why Even a Million Dollars Couldn’t Buy a Better Algorithm - Wired (Netflix Prize case study)
Reflecting on this case and the regression concepts from this module:
Respond to at least two peers’ posts, particularly focusing on whether you agree with their assessment of the accuracy-complexity trade-off.”
Day 8 | 1 hour
Preparation: Complete Module 2 lectures and readings
Days 9-14
Chapter 24: Natural Language Processing
Chapter 26: AI Agents
Opens Day 11 | Due Day 14
“AI agents are increasingly being deployed in business contexts. Describe a workflow or process in your organization that could potentially be automated or augmented by an AI agent. Address:
Respond to at least two peers’ posts.”
Day 14 | 1 hour
Preparation: Complete final project; prepare 2-3 minute demonstration
This part (as ever other part of the module) is optional.
Business Problem: You are a pricing analyst at a retail chain. Management wants to optimize orange juice pricing and promotional strategies. Build an AI agent that can answer business questions about pricing decisions using historical sales data and a predictive model.
Dataset: Dominick’s Orange Juice Dataset
Model: Linear Regression with Interactions
Your Agent Must Answer These Business Questions:
Deliverables:
Evaluation Criteria:
See Final Project Guide for detailed step-by-step instructions.